Accurate floating point summation∗
نویسندگان
چکیده
We present and analyze several simple algorithms for accurately summing n floating point numbers S = ∑n i=1 si, independent of how much cancellation occurs in the sum. Let f be the number of significant bits in the si. We assume a register is available with F > f significant bits. Then assuming that (1) n ≤ b2F−f/(1 − 2−f )c + 1, (2) rounding is to nearest, (3) no overflow occurs, and (4) all underflow is gradual, then simply summing the si in decreasing order of magnitude yields S rounded to within just over 1.5 units in its last place. If S = 0, then it is computed exactly. If we increase n slightly to b2F−f/(1− 2−f )c+ 3 then all accuracy can be lost. This result extends work of Priest and others who considered double precision only (F ≥ 2f). We apply this result to the floating point formats in the (proposed revision of the) IEEE floating point standard. For example, a dot product of IEEE single precision vectors ∑n i=1 xi ·yi computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps as long as n ≤ 33. If double extended is used n can be as large as 65537. We also show how sorting may be mostly avoided while retaining accuracy. ∗Computer Science Division Technical Report UCB//CSD-02-1180, University of California, Berkeley, 94720. This research was supported in part by LLNL Memorandum Agreement No. B504962 under the Department of Energy Contract No. W-7405-ENG-48 and DOE Grant No. DE-FG03-94ER25219, the National Science Foundation under Grant No. ASC-9813362, NSF Cooperative Agreement No. ACI-9619020, NSF Infrastructure Grant No. EIA-9802069, the National Science Foundation Graduate Research Fellowship, and by a gift from Intel. The information presented here does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred. †Computer Science Division and Mathematics Department, University of California, Berkeley, CA 94720 ([email protected]). ‡Computer Science Division, University of California, Berkeley, CA 94720 ([email protected]).
منابع مشابه
Accurate floating-point summation: a new approach
The aim of this paper is to find an accurate and efficient algorithm for evaluating the summation of large sets of floating-point numbers. We present a new representation of the floating-point number system in which a number is represented as a linear combination of integers and the coefficients are powers of the base of the floating-point system. The approach allows to build up an accurate flo...
متن کاملGroup-Alignment based Accurate Floating-Point Summation on FPGAs
Floating-point summation is one of the most important operations in scientific/numerical computing applications and also a basic subroutine (SUM) in BLAS (Basic Linear Algebra Subprograms) library. However, standard floating-point arithmetic based summation algorithms may not always result in accurate solutions because of possible catastrophic cancellations. To make the situation worse, the seq...
متن کاملError-free transformations in real and complex floating point arithmetic
Error-free transformation is a concept that makes it possible to compute accurate results within a floating point arithmetic. Up to now, it has only be studied for real floating point arithmetic. In this short note, we recall the known error-free transformations for real arithmetic and we propose some new error-free transformations for complex floating point arithmetic. This will make it possib...
متن کاملAccurate Sum and Dot Product
Algorithms for summation and dot product of floating point numbers are presented which are fast in terms of measured computing time. We show that the computed results are as accurate as if computed in twice or K-fold working precision, K ≥ 3. For twice the working precision our algorithms for summation and dot product are some 40 % faster than the corresponding XBLAS routines while sharing simi...
متن کاملAccurate summation, dot product and polynomial evaluation in complex floating point arithmetic
Article history: Available online 30 March 2012
متن کاملTwofold fast summation
Debugging accumulation of floating-point errors is hard; ideally, computer should track it automatically. Here we consider twofold approximation of exact real with value + error pair of floating-point numbers. Normally, value + error sum is more accurate than value alone, so error can estimate deviation between value and its exact target. Fast summation algorithm, that provides twofold sum of ∑...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002